World Happiness Report & Economic Data Analysis

Economic datasets: https://databank.worldbank.org/source/world-development-indicators

Table of Contents

  1. Introduction
  2. Loading and Combining Datasets
  3. Merging Economic Data
  4. Visualize the Data
  5. Data Cleaning and Preprocessing Includes Handling Missing Data
  6. Feature Engineering
  7. Model Selection
  8. Model Training and Evaluation
  9. Conclusion

Introduction

The goal of this project is to analyze the World Happiness Report and additional global economic indicators to understand better what influences happiness scores across countries.

The World Happiness Report is a landmark survey of the state of global happiness, which ranks countries by how happy their citizens perceive themselves to be.

This report is instrumental in understanding societal well-being and has been recognized by governments, organizations, and civil society across the world.

Datasets and Features

World Happiness Report

Time Range: 2015-2019

Features: 'Country', 'Year', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual'

Number of Countries: Varies year by year, around 150 countries each year

Missing Values: Some of the features such as 'Trust (Government Corruption)' and 'Generosity' have missing values in certain years for certain countries. These were handled during data pre-processing.

Selected economic features

Time Range: Matches with the Happiness Report, i.e., 2015-2019

Features: 'Access to clean fuels and technologies for cooking (% of population)', 'Access to electricity (% of population)', 'Agriculture, forestry, and fishing, value added (% of GDP)', 'Imports of goods and services (% of GDP)', 'Industry (including construction), value added (% of GDP)', 'Manufacturing, value added (% of GDP)', 'Military expenditure (% of GDP)', 'Tax revenue (% of GDP)'

Number of Countries: Data available for most countries globally, but only the ones present in the Happiness Report were considered for the final merged dataset

Missing Values: Some missing values were observed, which were handled during data pre-processing.

Loading and combining datasets

Helper functions:

Merging Economic Data

Only for Israel

The whole world

Visualize Data

Data Cleaning and Preprocessing

1. Deal with missing data:

2. Normalize and standardize the data:

3. Encode categorical variables:

Feature Engineering

1. Before proceeding with feature engineering or dimensionality reduction,

cleaning the dataset and handle any missing or problematic values.

2. Create new features that may be relevant for analysis:

3. Perform dimensionality reduction using methods PCA/t-SNE, selected is PCA:

Model selection

1. Split the data into training and testing sets:

2. Choose appropriate machine learning algorithms:

3. Train and evaluate the models using cross-validation and relevant performance metrics:

Model Training and Evaluation

4. Fine-tune your models using techniques like grid search or random search:

Current for Random Forest only

TODO - Add for more models (DecisionTreeRegressor)

TODO - Add for more models (LinearRegression)

5. Select the best model based on your evaluation criteria:

Train the RandomForestRegressor on the original dataset without Happiness Rank, Economy (GDP per Capita), Score, Countries and Years

Show Features importances (for Random forest regressor only)

All models:

Plot MSE and R-Squared score for each model

Conclusion

The goal of our project was to model and understand the key drivers of happiness levels across various countries using different machine learning methods: Linear Regression, Decision Trees, Random Forest, and XGBoost.

When interpreting the coefficients for the Linear Regression model, it's important to remember that they represent the change in the dependent variable (happiness level) for each one unit change in the predictor, assuming all other variables are held constant. For example, for each one unit increase in 'Dystopia Residual', we can expect an increase in happiness level of approximately 0.62 units, while keeping all other predictors constant.

Comparatively, the Decision Tree, Random Forest, and XGBoost models indicate feature importance rather than direct coefficients. These indicate which variables are most influential in predicting happiness level, based on their usage in creating splits in the decision trees.

In this project, it appears that "Access to clean fuels and technologies for cooking (% of population)" was a common significant predictor across all models.

Variables like "Health (Life Expectancy)", "Dystopia Residual", and "Generosity" also appeared frequently as top predictors. It is worth noting that economic variables like "Economic Growth Rate", "Military expenditure (% of GDP)", and "Tax revenue (% of GDP)" were generally found to have lower influence in these models, contrary to what one might intuitively expect.

The performance of our models was evaluated using Mean Squared Error (MSE) and R-squared statistics. A lower MSE indicates a better fit of the model to the data, while a higher R-squared value indicates that the model explains a larger proportion of the variance in the dependent variable.

In conclusion, this project provided insightful findings on the key predictors of happiness levels across different countries. Future studies may consider exploring other machine learning algorithms, adding more variables to the models, or using different strategies for handling missing or categorical data to further improve the performance of the models.